Automatic analysis of cochlear response using electrocochleography signals during cochlear implant surgery

Cochlear implants (CIs) provide an opportunity for the hearing impaired to perceive sound through electrical stimulation of the hearing (cochlear) nerve. However, there is a high risk of losing a patient’s natural hearing during CI surgery, which has been shown to reduce speech perception in noisy environments as well as music appreciation. This is a major barrier to the adoption of CIs by the hearing impaired. Electrocochleography (ECochG) has been used to detect intra-operative trauma that may lead to loss of natural hearing. There is early evidence that ECochG can enable early intervention to save natural hearing of the patient. However, detection of trauma by observing changes in the ECochG response is typically carried out by a human expert. Here, we discuss a method of automating the analysis of cochlear responses during CI surgery. We establish, using historical patient data, that the proposed method is highly accurate (∼94% and ∼95% for sensitivity and specificity respectively) when compared to a human expert. The automation of real-time cochlear response analysis is expected to improve the scalability of ECochG and improve patient safety.


Overview
Hearing loss was the fourth most prevalent disease and the third leading cause of 'Years Lost with Disability' in 2016 according to the Global Burden of Disease report [1]. Its treatment can prevent downstream effects such as dementia [2] and socioeconomic disadvantages [3]. Disabling hearing loss affected 6.1% of the world's population in 2017 [4]. A cochlear implant (CI) is a cost-effective solution [5] because it restores a recipient's economic and social independence. But as yet, as few as 5% of adults who may benefit from this technology receive a CI [6]. An important reason for this is that many CI candidates fear losing their residual, natural hearing [7][8][9] as occurs after 50-70% of CI surgeries. If this natural hearing could be reliably preserved during CI surgery, more patients would benefit from the advantages a CI can offer. Cochlear implantation has until recently been a 'blind procedure', where the surgeon was not able to monitor the condition of the inner ear during implantation. Any damage caused during the surgery could only be observed 2-3 weeks after the procedure using conventional hearing tests. However, intra-operative electrocochleography (ECochG) has made it possible to monitor the response of the cochlea to sound during surgery [10]. Previous studies have shown that changes in particular ECochG components can be used to predict the preservation of post-operative residual hearing [11,12]. The process of trauma detection has so far necessitated the presence of a human expert. This requirement for a human expert's presence during cochlear implantation is a significant barrier to the widespread use of intra-operative ECochG monitoring in clinical practice.
The real-time decisions made by a human expert have been demonstrated to predict hearing preservation in observational studies [13], as well as improve rates of hearing preservation when used to initiate intervention during implantation [14]. These results provide us with a validated approach to interventional cochlear implant guidance using electrocochleography. Here, we aim to automate this validated approach by analysing cochlear responses to achieve parity with the decisions made by a human expert, using multiple metrics derived from the ECochG response. We validate this method using historical patient data compared to the performance of a human expert.

Real-time electrocochleography
In real-time ECochG, an acoustic stimulus is delivered to the ear as the CI electrode is being inserted into the cochlea. The acoustic stimulus is presented with alternating polarity that reduces (rarefact) or increases (condense) air pressure at stimulus onset. Modern real-time ECochG systems typically use the electrode itself as a recording device, rather than an extracochlear electrode, as this improves the signal to noise ratio [12]. The electrical potentials recorded are composed of contributions from both sensory (hair) cells and the auditory neurons.
The response of the sensory (hair) cells is estimated by taking the difference of the alternating polarity responses (DIF). This is termed the 'Cochlear Microphonic' (CM) [15,16]. The CM is primarily generated by receptor currents in outer hair cells (OHCs) [17,18]. The amplitude (and phase) of the CM is measured by taking the Fast Fourier Transform (FFT) of the DIF response at the stimulus frequency, as the CM follows the stimulus. The neural response is estimated by taking the sum of the alternating polarity responses (SUM), which cancels frequency-following responses and leaves an estimation of the phase-locked 'Auditory Nerve Neurophonic' (ANN). As this phase-locking occurs preferentially as inner hair cells depolarize [19], it results in an asymmetric response which produces a large component of distortion at twice the stimulus frequency. The amplitude (and phase) of the ANN is therefore measured at the 2nd harmonic of the SUM [20]. Fig 1 shows an example of the signals measured using ECochG. Note that this approach to deriving the CM and ANN will not exclusively derive outer hair cell and neural contributions, and should be considered an estimation.

Manual detection of trauma
Intra-operative detection of trauma using ECochG has so far focused on amplitude changes of the CM, as the most readily detectable ECochG signal [11,12]. For example, [11] demonstrated that a 30% drop in CM amplitude predicts poorer preservation of natural hearing [11].

Database
In order to develop and validate our automatic trauma detection method, we used the intraoperative ECochG recordings of 77 patients that underwent CI surgery. The 77 patients were  drawn from 95 patients who had ECochG recorded in our department over 2017 and 2018. 18 patients were excluded due to low ECochG response amplitudes (<1μV). A 1 μV threshold for CM amplitude, specifically in the bandpass filtered DIF signal around the stimulus frequency (15th order digital bandpass filter around the stimulus frequency (0.9 � F to 1.1 � F)), was used as this is the minimum amplitude required to exceed the noise floor of our ECochG system by 3 standard deviations. In our experience, once exceeded, this allows sufficient signal to noise ratio for us to detect a 30 percent drop reliably. Written consent was obtained from each patient and data was anonymised to remove all identifying information. Ethics approval for the collection of this data was obtained by the Human Ethics Committee of the Royal Victorian Eye and Ear Hospital (HREC #14/1171H/19).
Our approach to ECochG has been described in detail previously [11]. Briefly, adult subjects with residual low-frequency hearing (hearing threshold � 80dBHL at 0.5kHz) receiving Cochlear Limited's Nucleus CI422 or 522 implants were enrolled in this study. ECochG was recorded using the Cochlear Response Telemetry system. The acoustic stimuli was a tone pip with a frequency of 0.5kHz and a duration of 12ms with 1ms linear on-and offset ramps, at a presentation rate of 14 per second. The intensity of the acoustic stimuli was 100-110 dB HL. The ECochG response was recorded from the electrode at the tip of the array, at recording windows of 12ms in duration at 20 kHz. Each waveform was averaged from 100 stimulus presentations. Once recorded, the amplitude of the DIF response was calculated by taking the magnitude of the DIF response at the stimulus frequency, here 0.5kHz. The amplitude was calculated by (FFT), by first zero-padding to 1000 samples for 20Hz bin size, and then taking the bin at the stimulus frequency.
An expert in ECochG labelled the data at two levels. First, each recording was labelled as a 'drop' or 'no drop' depending on whether any drops in the ECochG signal were observed or not. Second, in the patients whose recording had drops in the ECochG signal, the duration of each drop (from the time a human expert would start detecting it in real-time to the end of the drop) was identified. Each time point in each identified duration was then labelled as a 'drop' instance and all other time points were labelled as 'no drop' instances.
As drops, by definition, have to be on falling edges of the ECochG signal, all time points where the CM signal was rising or constant were removed from the data, in order to simplify the classification problem. The remaining patient data was randomly separated into 5 subsets ensuring each subset included at least one patient with a drop. These subsets were then used in a 5-fold cross validation for the training and testing of our trauma detection method.

Experimental setup
All methods were implemented in MATLAB R2020a. Classifiers included in the Statistics and Machine Learning Toolbox were used in the experiments. All experiments were run on a computer with a 2.11GHz Intel Core i5 processor and 8GB of RAM. The operating system was 64-bit Windows 10.

Features
We used features including those used by human experts in detecting drops and/or been shown in previous work to inform the presence of a drop in the ECochG signal. These features were derived from the CM at the fundamental frequency, 500Hz, and the corresponding ANN at 1KHz. The features considered were: the amplitude and phase of the CM and ANN signals, ratio of the CM and ANN amplitudes, the fraction of the CM amplitude with respect to the immediate previous peak in the signal, time from the immediate previous peak, and coefficient of variation (standard deviation / mean) of the CM amplitude in a window including the current instance and 4 immediately prior to that.
In order to determine the 2 peak-related metrics, we developed a real-time algorithm that detected peaks and troughs in an ECochG signal. First, we identified all local peaks and troughs. If at time t, the amplitude of the CM signal satisfies the conditions kCMk t−2 < kCMk t−1 and kCMk t−1 � kCMk t , kCMk t−1 is a peak. Similarly, kCMk t−1 is a trough if kCMk t−2 � kCMk t−1 and kCMk t−1 < kCMk t . Each peak thus detected and the trough that occurred immediately before that were paired together. Second, if a peak was detected, we identified if it was an active peak or a local fluctuation using the following conditions. The current peak was considered to be an active peak if: • There were no active peaks detected so far, or • The signal at the current peak is larger than the last active peak, or • The signal at the current peak is larger than s a times the last active peak, and the amplitude difference between the current peak and trough is larger than s b times the current peak, where s a and s b are scalars defined as 0.5 and 0.1 respectively through trial and error.
In addition to these primary features (derived from the CM at 500Hz and ANN at 1KHz), we determined the same for their harmonics at different frequencies (CM at 1KHz, 1.5KHz, and 2KHz and ANN at 2KHz, 3KHz, and 4KHz). We also considered a window of 3 time points (starting at 2 points prior to the current time point) and used the features calculated at these points in the feature set in order to test if previous readings would contribute to the detection of trauma. As such, we considered 96 features in total for each time point.

Data normalisation
The potentials recorded with ECochG vary substantially between patients and are often normalised [21,22]. We used the values of the initial ECochG readings as a baseline to normalise amplitude data. To this end, we calculated the mean of the CM/ANN amplitude of the first 5 data points and subtracted this value from every subsequent amplitude reading. We used the sin value of the phase angle for CM and ANN to ensure continuity of data. For each fold of the 5-fold cross validation, we then normalised each feature (in both training and test sets) to be in the range [01] using the minimum and maximum values of the corresponding feature in the training data.

Drop classification
We trained and tested several classification methods using the above features in order to detect if a drop in the ECochG signal occurred at a given time point. The classifiers thus trained were: tree ensembles (TE) [23], discriminant analysis (DA) [24], naive Bayes (NB) [25], support vector machines (SVM) [26], k-nearest neighbours (KNN) [27], and neural networks (NN) [28].
We used different variants of these classifiers. To create tree ensembles, we used adaptive boosting (AdaBoost) [29], random under sampling boosting (RUS) [30], and bootstrap aggregation bagging (Bag) [31] algorithms. For discriminant analysis, we used linear and quadratic functions. In Naive Bayes, we used 2 different ways of calculating the probability density function: using a normal distribution (Gaussian) and kernel density estimation (Kernel). When using support vector machines, we employed linear, quadratic, cubic, and Gaussian kernel functions. We considered 3 variations of the Gaussian kernel with different σ values (Fine: 1.1, Medium: 4.5, and Coarse: 18). In k-nearest neighbour classification, we used different numbers of neighbours and distance functions (Fine, Medium, and Coarse: Euclidean distance with 1, 10, and 100 neighbours respectively, Cosine and Cubic: 10 neighbours with Cosine and Minkowski distances respectively, and Weighted: 10 neighbours with weighted Euclidean distance). We defined our artificial neural network as a feedforward (FF) network with one hidden layer of 10 neurons.

Post processing
To identify obviously misclassified observations, we implemented a post-processing algorithm. This algorithm compared the current data (and the corresponding classification) to that seen previously in the same patient. The benefits of this strategy are twofold: 1) it can be implemented in real-time and 2) the variability of the between-patient CM amplitudes due to variations in natural hearing (which is necessarily present in the classification stage) can be avoided. The following conditions were used to correct misclassification of points.
• If the previous point on a falling edge (of the kCMk) was classified as a 'drop', the current point is also a 'drop'.
• If a point on a falling edge is classified as a 'drop' but the standard deviation of the kCMk in a window including that point and 4 previous points is less than s c times the minimum kCMk in that window, it is a 'no drop'.
The scalar s c was determined through trial and error to be 0.01. Fig 4 shows the results of this algorithm for a patient in the test set of one of the folds in our cross-validation dataset.

Performance metrics
In order to evaluate the performance of the classification methods, we used the commonly used metrics of sensitivity (the ability to correctly detect 'drops') and specificity (the ability to correctly detect 'no drops') [32]. We also calculated the overall accuracy of the classification. Eq 2 shows how these metrics were calculated.

Feature correlation
We calculated the pair-wise linear correlation between features to determine if there were any dependencies between them. We observed that the Pearson's correlation coefficient (r) between features was negligible (|r| < 0.25) for 22 feature pairs, weak (0.25 � |r| < 0.5) for 4 feature pairs, and moderate (0.5 � |r| < 0.75) for 2 feature pairs. Fig 6 shows the correlations between feature pairs. Since there were no strong correlations (|r| > 0.75) between features, we used all 8 features to train our classifiers.

Selection of a classification method
We first trained the above mentioned classifiers on the primary features at time t (not considering features derived from the harmonics or those from previous time points) in order to select the best classifier for our purpose. As the dataset was unbalanced, we biased the training towards the detection of drops by increasing the penalty for misclassifying drops. As an initial value for this penalty term, we used the ratio between the number of 'no drop' and 'drop' instances (�22). Test results of the classification (the average of the 5-fold cross validation) are shown in Table 1. Four classifiers (AdaBoost tree ensemble, support vector machines with quadratic and Gaussian (σ = 18) kernels, and k-nearest neighbours with Eucledean distance and 100 neighbours) showed high performance levels (>0.9 for all 3 performance metrics). From these 4 classifiers, we selected the one with the highest accuracy (AdaBoost tree ensemble) to be used in the next stages of the process.

Refinement of classifier
We observed that when the classifier was trained using the full 94 features (including features from harmonics as well as the full time window), it did not significantly improve classification results (0.8971, 0.9563, and 0.9537 for sensitivity, specificity, and accuracy respectively). Therefore, only the primary features were used in the classification.
We determined the ideal misclassification cost (with respect to sensitivity and specificity) using an iterative search. Fig 7 shows the results of this search. We selected 41 as the best misclassification cost as it maximised both sensitivity and specificity. The final results were 0.9356, 0.9484, and 0.9478 for sensitivity, specificity, and accuracy respectively. The average prediction time per instance (for peak-detection, classification, and post-processing) was �0.75 ms.

Feature importance
We obtained the importance of the different features for the selected classifier. To this end, we calculated the average weights across the 5 folds for each feature. The feature that explained about 47% of the classification results was the ratio of kCMk to the previous peak. The coefficient of variation of kCMk and time from the previous peak contributed similarly (�19% and �17% respectively). kCMk kANNk, and kCMk:kANNk attributed for about 6%, 5%, and 4% respectively. ϕ(CM) and ϕ(ANN) accounted for only �2% and �1% of the results respectively.

Discussion
Intra-operative monitoring of inner-ear health during cochlear implantation is a rapidly advancing method for predicting the preservation of residual hearing [11,12]. Thus far, these methods have been largely observational, and when used to modify surgical approach, have been done as ad-hoc, surgeon-driven processes without consistency of approach or a dedicated tool [33]. The common end-goal of clinical research into intra-operative ECochG is to automatically and rapidly provide feedback in the operating theatre, removing the need for an expert observer. To advance this approach, we introduced here a framework for automatically detecting insertion trauma using a machine learning algorithm. Prior attempts at improving the sensitivity and specificity of trauma detection with intraoperative ECochG have used human-picked features (for example, Weder et al. [34]), with a maximum sensitivity and specificity of 89% and 69% respectively. The approach used here is sensitive and specific at matching these drops (�94% and �95% respectively), which will provide a clinical foundation for use of the algorithm in the clinic. The precise features of the complex ECochG signal that provide the highest accuracy in detecting trauma are under evolving debate, with some improvements shown when including features such as CM latency [35] and ANN:CM ratio [36]. In the model developed here, including these features resulted in only a small improvement in detection rate (<10%).
In earlier studies optimising trauma detection, for example, in [36], typically an observational approach was used, comparing the peak CM response immediately prior to a drop to that at the nadir of the drop. However, at this point, it may be too late to initiate intervention. The goal of the present study was to facilitate the provision of rapid, intra-operative feedback as and when a drop occurs. The speed of detection of the proposed method is suitable for this purpose (�0.75ms on a computer with a 2.11 GHz processor). Rapid real-time feedback not only leads to quicker intervention but also assists in consistent and smoother insertion, which has been proven to reduce trauma [37].
While this study improved the accuracy of CM drop detection over a basic assessment of CM amplitude, it still relied on human assessment as the ground truth. It remains to be seen if methods based on classification such as that explored here, or other machine learning methods (such as change detection and recurrent neural networks), could be utilsed to surpass the human observer in detecting traumatic events.
Although the use of cross-validation ensured the robustness of the models, they were trained/tested on a relatively small dataset. To ensure their applicability on a wide range of patients, they need to be retrained on larger datasets as and when they become available. Also, limits of performance where the models may fail need to be explored and safeguards implemented to handle such situations, prior to practical use in the operating theatre.

Conclusion
In this paper, we introduced a framework for detecting trauma during cochlear implant surgery using ECochG data. The process thus discussed consisted of 3 steps: feature detection, classification, and post-processing. All algorithms were specifically designed to enable realtime trauma detection. We achieved high performance results: �94% and �95% sensitivity and specificity respectively. The average prediction time was less than �0.75ms, indicating its viability to be used during surgery. We anticipate that the inclusion of automatic trauma detection in ECochG will improve its utility and scalability.